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Abstract 



The paper addresses the problem of low-rank trace norm minimization. We propose 
an algorithm that alternates between fixed-rank optimization and rank-one updates. The 
fixed-rank optimization is characterized by an efficient factorization that makes the trace 
norm differentiable in the search space and the computation of duality gap numerically 
tractable. The search space is nonlinear but is equipped with a particular Riemannian 
structure that leads to efficient computations. We present a second-order trust-region 
algorithm with a guaranteed quadratic rate of convergence. Overall, the proposed op- 
timization scheme converges super-linearly to the global solution while still maintaining 
complexity that is linear in the number of rows of the matrix. To compute a set of solu- 
tions efhciently for a grid of regularization parameters we propose a predictor-corrector 
approach on the quotient manifold that outperforms the naive warm-restart approach. 
The performance of the proposed algorithm is illustrated on problems of low-rank matrix 
completion and multivariate linear regression. 

1 Introduction 

The present paper focuses on the convex program 



where / is a smooth convex function, ||X||^, is the trace norm (also known as nuclear norm) 
which is the sum of the singular values of X [15^ \29\ [12] and A > is the regularization 
parameter. Programs of this type have attracted much attention in the recent years as efficient 
convex relaxations of intractable rank minimization problems |15] . The rank of the optimal 
solution X*(A) of ([T]) decreases to zero as the regularization parameter grows unbounded [S] 
and therefore, generating efficiently the regularization path {X*(Aj)}j=i^...^7V5 for a whole range 
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of values of Aj minimizers, is a convenient proxy to obtain suboptimal low-rank minimums of 
/■ 

Motivated by machine learning and statistical large-scale regression problems |29l [371 135] , 
we are interested in very low-rank solutions (p < 10^) of very high-dimensional problems 
[n > 10^). To this end, we propose an algorithm that guarantees second-order convergence to 
the solutions of ([T]) while ensuring a tight control (linear in n) on the data storage requirements 
and on the numerical complexity of each iteration. 

The proposed algorithm is based on a low-rank factorization of the unknown matrix, 
similar to the sing ular value decomposition (SVD), X = UBV^. Like in SVD, U € M'^^p and 
V S M™'^^ are orthonormal matrices that span row and column spaces of X. In contrast, the 
p X p scaling factor B = ;^ is allowed to be non-diagonal which makes the factorization 
non-unique. 

Our algorithm alternates between fixed-rank optimization and rank-one updates. When 
the rank is fixed, the problem is no longer convex but the search space has nevertheless a 
Riemannian structure. We use the framework of manifold optimization to devise a trust- 
region algorithm that generates at low-cost (linear in n) iterates that converge super-linearly 
to a local minimum. Local minima are then escaped by incrementing the rank until the global 
minimum in reached. The rank-one update is always selected to ensure a decrease of the cost. 

Implementing the complete algorithm for a fixed value of the regularization parameter A 
leads to a monotone convergence to the global minimum through a sequence of local minima of 
increasing ranks. Rather, we also modify A along the way with a predictor-corrector method 
thereby transforming most local minima of ([T|) (for fixed A and fixed rank) into global minima 
of ([I]) for different values of A. The resulting procedure, thus, provides a full regularization 
path at a very efficient numerical cost. 

Not surprisingly, the proposed approach has links with several earlier contributions in the 
literature. Primarily, the idea of interlacing fixed-rank optimization with rank-one updates 
has been used in semidefinite programming |1H [T7] . It is here extended to a non-symmetric 
framework using the Riemannian geometry recently developed in [71 [23l [25] . An improvement 
with respect to earlier work [TTl[l7j is the use of duality gap certificate to discriminate between 
local and global minima and its efficient computation thanks to the chosen parametrization. 

Schemes that combine manifold optimization and special rank-one updates have appeared 
recently in the particular context of matrix completion [181 136j . The framework presented 
here is in the same spirit but in a more general setting and with a global convergence analysis. 
Most other fixed-rank algorithms for matrix completion are first-order schemes |321 \TE[ [22l [30l 
[MlllSl 13 [M] . It is more difficult to provide a tight comparison of the proposed algorithm to 
trace norm minimization algorithms that do not fix the rank a priori [121 l2Tl [371 S] ■ It should 
be emphasized, however, that most trace norm minimization algorithms use singular value 
thresholding operation at each iteration. This is the most numerically demanding step for 
these algorithms. For the matrix completion application, it involves computing (potentially 
all) the singular values of a low-rank + sparse matrix. In contrast, the proposed approach 
only involves rank-one updates. Furthermore, only a few number of such updates are used. 
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For the sake of illustration and empirical comparison with state-of-the-art algorithms we 
consider two particular applications, low-rank matrix completion [13] and multivariate linear 
regression [37]. In both cases, we obtain iterative algorithms with a numerical complexity 
that is linear in the number of observations and with favorable convergence and precision 
properties. 

2 Relationship between convex program and non-convex 
formulation 

Among the different factorization that exist to represent low-rank matrices, we use the polar 
factorization [25^ [7] that decomposes a p — rank matrix X G M"^™- into 



where U G St(p, n), V G St{p,m) and B G S-^-+{p). St(p, n) is the Stiefel manifold or the set 
of n X p matrices with orthonormal columns. S^-^-{p) is the cone of p x p positive definite 
matrices. We stress that the scaling B = B"^ ^ is not required to be diagonal. The 
redundancy of this parametrization has non-trivial algorithmic implications (see Section [3|) 
but we believe that it is key to success of the approach (see \18\ [25] for earlier algorithms 
advocating matrix scaling and Section [6. II for a numerical illustration). With the use of polar 
factorization, the trace norm is written as 



that makes it differentiable and numerically tractable. The non-convex formulation of ([T|) is 
thus recast as 



In addition, the search space is not Euclidean but the product space of two well-studied 
manifolds, namely, the Stiefel manifold [14] and the cone of positive definite matrices [31] . 
This provides a proper geometric framework to perform optimization. From the geometric 
point of view, the column and row spaces of X are represented on the Stiefel manifold whereas 
the scaling factor is absorbed into the positive definite part. A proper metric on the space 
takes into account both rotational and scaling invariance. The usefulness of this separation, 
of subspaces and scaling, has been for instance demonstrated in [18] where the authors use a 
similar factorization scheme for the matrix completion problem. 

2.1 First-order optimality conditions 

In order to relate the solution of ([2]) to that of the convex optimization problem ([TJ we look at 
the necessary and sufficient optimality conditions that govern the solutions. The first-order 
necessary and sufficient optimality condition for the convex program ([1]) is 



X = UBV^ 



X||* = Trace(B) 



minu,B,v /(UBV^) + ATVace(B) 

subject to UGSt(p, n), B G ^^^(p) and VGSt(p,m). 



(2) 



G Gradx/(X) + Aa||X 



(3) 
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where Gradx/ is the Euchdean gradient of / in M"^"* at X and 5||X||* is the sub-differential 
of the trace norm. Details about optimality conditions for trace norm are given in [U 129] . 
The first-order necessary conditions ([2]) are [3l [271 EH] 

SVB - USym(U^SVB) = 

Sym(U^SV + AI) = (4) 

S^UB - VSym(V'^S'^UB) = 

where Sym(A) = ^^^^ and S = Gradx/(UBV"^). S is referred to as dual variable through- 
out the paper. The first-order optimality conditions can be derived either by writing the La- 
grangian of the problem ([2]) and looking at the KKT conditions or by deriving the gradient of 
the function on the structured space St(p, n) x S^j^{p) x St(p, m) using a metric (fTTj) defined 
in Section [3l 

Proposition 2.1. A local minimum of (0j X = UBV^ is also the global optimum of ^ iff 

II S Hop = A 

where S = Gradx/(UBV"^) and ||S||op is the operator norm, i.e., the dominant singular value 
ofS. Moreover, ||S|| op > A and equality holds only at optimality. 

Proof. This is in fact the first-order optimality condition of ([3]) \12\ [20] . □ 

Proposition 12.11 leads to consider the approximate optimality condition that a local min- 
imum of ([21) is identified with the global minimum of ([H) if ||S||op — A < e where e is a 
user-defined threshold. 



2.2 Duality gap computation 

Proposition 12.11 provides a criterion to check the global optimality of a solution of How- 
ever, it provides no guarantees on closeness to the global solution. A better way of certifying 
closeness for the optimization problem of type (HI) is provided by the duality gap. The duality 
gap characterizes the difference of the obtained solution from the optimal solution and is al- 
ways non- negative [9]. Without loss of generality we introduce a dummy variable Z G M"-^™ 
to rephrase the optimization problem ([T]) as 

minx.z f(^) + M\'^\\* 
subject to Z = X. 

The Lagrangian of the problem with dual variable M G ]^"x»t* written as 

£(X, Z, M) = /(X) + A||Z||* + Trace(M^(Z - X)) 

The Lagrangian dual function g of the Lagrangian C is, then, computed as [9^ 2] 

g{M) = minx,z C{X,Z,M) 
=^ g{M) = minx,z /(X) - TVace(M^X) + Trace(M^Z) + A||Z||* 
=^ g{M) = minx {/(X) - Trace (M^X)} + minz {Trace(M^Z) + A||Z||4 
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For a pair (X, M) the duality gap is then defined as /(X) + A||X||* — (?(M). To compute the 
expression of the duahty gap we need to calculate analytically g and suggest a value for the 
dual variable M. Using the concept of dual norm of trace norm, i.e., operator norm we have 

minz Trace(M^Z) + A||Z||* = if ||M||op < A 

Similarly, using the concept of Fenchel conjugate of a function we have 

minx /(X) - Trace(M^X) = -/*(M) 

where /* is the Fenchel conjugate [Hll] of /, defined as 

/*(M) = supxgRnx^ [TVace(M^X) - /(X)] . 

The final expression for the dual function is, thus, [6] 

g{M) = -f*{M) subject to ||M||op < A. 

The final expression of duality gap is 

/(X) + A||X||,-(7(M) = /(X) + A||X||, + r(M) subject to ||M||op < A (5) 

where M is the dual candidate. A good choice for the dual candidate M is S (= Gradx/(X)) 
with appropriate scaling to satisfy the operator norm constraint: M = min{l, — }S [B]. 

As an extension for some functions / of type /(X) = ip{A(X.)) where ^ is a linear operator, 
computing the Fenchel conjugate of the function ip may be easier than that of /. The duality 
gap, using similar calculations as above, is obtained as 

/(X) + A||X||* + V*(M) subject to P*(M)||op < A 

where A* is the adjoint operator of A and ^* is the Fenchel conjugate of ip. A good choice 
of M is again min{l, ■^}Gia.dip where is the dominant singular value of A*{Giad%lj) [6]. 

3 Manifold-based optimization to solve the non-convex 
problem ([2]) 

In this section we solve the problem ^ to obtain a local minimum. In contrast to first-order 
optimization algorithms proposed earlier in [25l [24l IIH] , we develop a second-order trust-region 
algorithm that has a quadratic rate of convergence |27l [3] . The idea behind a trust-region 
algorithm is to build locally a quadratic model of the function at a point and solve the trust- 
region subproblem to get the next potential iterate. Depending on whether the decrease in 
the objective function is sufficient or not, the potential iterate is accepted or rejected. Details 
about a general trust-region algorithm are given in |27j. Rewriting the problem ([2]) as 



minu,B,v (?!<(U,B,V) 

subject to (U, B, V) G St(p,n) x S++{p) x St(p,m) 



(6) 
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where (^(U, B, V) = /(UBV"^) + ATrace(B) is introduced for notational convenience. An 
important observation for second-order algorithms [21 [3] is that the local minima of the 
problem ^ are not isolated in the search space 

Mp = St{p,n) X S'++(p) X St(p, m). 

This is because the cost function is invariant under rotations, i.e., 

UBV^ = (UO)(O^BO)(VO)^ 

for any p x p rotation matrix O £ 0{p). To remove the symmetry of the cost function, we 
identify all the points of the search space that belong to the equivalence class defined by 

[(U,B,V)] = {(UO,O^BO,VO)|0 G 0{p)}. 

The set of all such equivalence classes is termed the quotient manifold of A4p by 0{p), i.e., 
denoted by 

Mp = Mp/0{p). (7) 

The problem ^ is thus conceptually an unconstrained optimization problem on the quotient 
manifold Aip in which the minima are isolated. Computations are performed in the total 
space Aip, which is the product space of well-studied manifolds. 




[(Uo,Bo,Vor •[(U,B,V)] 
Figure 1: The quotient manifold representation of the search space. 

Tangent space of Aip 

Tangent vectors at a point x G Mp have a matrix representation in the tangent space of 
the total space Aip. Note that x belongs to Aip and its equivalence class is represented 
by the element x G M.p such that x = [x]. Because the total space is a product space 
St{p,n) X S++{p) X St(p, m), its tangent space admits the decomposition at a point x = 
(U,B,V) _ 

TsMp = TuSt(p,n) X TbS++{p) x TvSt{p,m) 
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and the following characterizations are well-known [14^ [3T] 

ruSt(p,n) = {Un + UxK|nG5,A:e«,(p),KGM(«-P)xP} 

= {Zu - USym(U^Zu)|Zu G M"^n 
Tb5'++(p) = Ssymip) 

where Ssym{p) is the set oi p x p symmetric matrices, Sskew{p) is the set of p x p skew- 
symmetric matrices and U_l is the orthogonal complement of the space spanned by U. Note 
that an arbitrary matrix (Zu,Zb,Zv) G M"^^ x W^p x W^^(p) is projected on the tangent 
space TxMp by the linear operation 

Zb, Zv) = (Zu - USym(U'^Zu), Sym(ZB), Zv - VSym(V^Zv)). (8) 

A matrix representation of the tangent space at x G Aip relies on the decomposition of TxMp 
into its vertical and horizontal subspaces. The vertical space Vx-Mp is the subspace of TxM.p 
that is tangent to the equivalence class [x] 

VxMp = {(Un,Bn - fiB, Vn)|n E SskeUp)}- (9) 

The horizontal space TixJ^p must be chosen such that 

TxMp = UxMp e VxMp. (10) 

We choose TixMp as the orthogonal complement of Vx-Mp for the metric 

9x{Cx,r]x) = Trace(^5r/u) + Trace(B-i^BB-^r/B) ^^^^ 
+Trace(^^r/v), 

which picks the normal metric of the Stiefel manifold [14j and the natural metric of the positive 
definite cone |31j . Here (,x and rjx are elements of TxMp. With this choice, a horizontal tangent 
vector C,x is any tangent vector ((^Ui Cb> Cv) belonging to the set 

UxMp = {(Cu,Cb,Cv) GT^A^p|fe((Cu,CB,Cv),(Un,(Br2-r2B),Vn)) =0} (12) 

Starting from an arbitrary tangent vector rjx G TxAip we construct its projection on the 
horizontal space by picking 17 G Sskew{p) such that 

^x{r]x)= (7?u-Un,7?B-(Br2-nB),r/v-Vn) (^n-xMp, (13) 

Using the calculation (I12p . the unique that satisfies (jlSh is the solution of the Sylvester 
equation 

nB2 + B^ri = B(Skew(U^r?u) - 2Skew(B-i7?B) + Skew(V^7?v))B (14) 

where Skew (A) = ■ The numerical complexity of solving the Sylvester equation is 

0(p3). 
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The Riemannian submersion {Aip,g) 

The choice of the metric (jlip . which is invariant along the equivalence class [x], and of the 
horizontal space ()12p turn the quotient manifold Aip into a Riemannian submersion of {Aip, g) 
[5]. As shown in [3], this special construction allows for a convenient matrix representation 
of the gradient and the Hessian on the abstract manifold Mp. The Riemannian gradient of 
(p ■ -Mp — 7- M : a; I— 7- (j){x) = (j){x) is uniquely represented by its horizontal lift in Aip which has 
the matrix representation 

grad^-c/) = Usigmd^c^). 

On the other hand, the matrix expression of gradj;^ at a point x = (U, B,V) is standard: 
the Euclidean gradient (Gradu^, Grade Gradvcp) must simply be projected on T^Aip, i.e., 

grad^cj) = ^'j.(Gradu0, Gradet^, Gradvi^) 

Likewise, the Riemannian connection Vi/?7 on Aip is uniquely represented by its horizontal 
lift in M-p which is 

where u and t] are vector fields in A4p and u and f] are their horizontal lifts in A4p. Once 
again, the Riemannian connection Vpfy on Aip has well-known expression 116 ^ 131 ^ 13]. obtained 
by means of the Koszul formula, given by 

Vpf/ = ^:s(Df/[i>]) - ( i^Sym(U^7?u), Sym(r/BB-ir/B), zA^Sym(V^r?v) ) • (15) 

Here D7y[^'] is the classical Euclidean directional derivative of r/ in the direction u. The 
Riemannian Hessian in Aip has, thus, the following matrix expression 

Hess(/.(x)[e] = lis (y-^^) ■ (16) 
for any G T^Aip and its horizontal lift ^ G TixMp- 

Trust-region subproblem and retraction on Aip 

Analogous to trust-region algorithms in the Euclidean space, trust-region algorithms on a 
quotient manifold with guaranteed quadratic rate convergence have been proposed in [H [3]. 
The trust-region subproblem on M is formulated as 

minggT,A4p 4>{^) +5x(C,grad(/>(x)) + i5a;(^, Hess(/>(x)[^]) 
subject to gx{(,,0 < 

where 5 is the trust-region radius and grade/) and Hesse/) are the Riemannian gradient and 
Hessian on Aip. The problem is horizontally lifted to the horizontal space TixMp where it is 
solved using a truncated- conjugate gradient method with parameters set as in Alg. 2 of [1]. 
Refer [l] for further details. Solving the above trust-region subproblem leads to a direction 
that minimizes the quadratic model. To find the new iterate based on the obtained direction 
^, a mapping from the tangent space T^Mp to the manifold Mp is required. This mapping 
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is more generally referred to as retraction which maps the vectors from the tangent space 
onto the points on the manifold, Rx ■ T^Aip — )■ Aip (details in ^). In the present case, a 
retraction of interest is [31 [25] 

^u(eu) = uf(U + eu) 

^B(eB) = B3exp(B-ieBB-^)Bi (17) 
i?v(ev) = uf(V + ev) 

where uf is a function that extracts the orthogonal factor of the polar factorization and exp 
is the matrix exponential operator. The retraction on the positive definite cone is the natural 
exponential mapping for the metric (jlip [31j . 

Numerical complexity 

The numerical complexity per iteration of the proposed trust-region algorithm to solve ([6]) 
depends on the computational cost of the following components. 

• Objective function (f) — > problem dependent 

• Metric g — > 0{np^ + mp^ + p^) 

• Euclidean gradient of (j) — > problem dependent 

. VpT? = ^'(Df/[i>]) - 1'(z.uSym(U^7?u),Sym(z.BB-i7?B),i^vSym(V^7?v)) 

— D77[i>] — > problem dependent 

— Matrix u\jSym.i\]^ rju) — > 0{np'^) 

— Matrix Sym(z^BB"i??B) — > 0{p^) 

— Matrix z/vSym(V-^r7v) — > 0{mp'^) 

• Projection operator ^' — > 0{np'^ + mp"^) 

• Projection operator 11 — > 0{np^ + mp'^ + p^) 

— Sylvester equation for fl — > 0{p'^) 

• Retraction R — > 0{np'^ + mp'^ + p"^) 

As shown above all the manifold related operations are of linear complexity in n and m. Other 
operations depend on the problem at hand and are computed in the search space Aip. With 
p <^ min{n, m} the computational burden on the algorithm considerably reduces. Similarly, 
sparsity in the problem can also be exploited to compute the components "cheaply". 

4 An optimization scheme to solve convex program ([l]) 

In order to solve the convex program ([T|), we alternate a second-order local optimization 
algorithm on fixed-rank manifold with a first-order rank-one update. The scheme is shown in 
Table [TJ The rank update is constructed in such a way that 
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Algorithm to solve convex problem dT} 
0. • Initialize p to po, a guess rank. 

• Initialize the threshold e for convergence criterion, refer to Proposition 



• Initialize the iterates Uo € St{po,n), Bo £ S++(po) and Vq G St{po,in). 

1. Solve the non-convex problem ^ in the dimension p to obtain a local minimum 
(U,B,V). 

2. Compute cri (the dominant singular value) of dual variable S — 
Gradx/(UBV^). 

• If (7i — A < e (or duality gap < e), output X = UBV'^ as the solution to 
problem and stop. 

• Else, compute the update as shown in Theorem (|4.ip and compute the 
new point (U+,B+, V+) as described in (|19|) . Set p = p + 1 and repeat 
step 1. 



Table 1: Algorithm to solve the trace norm minimization problem of type ([T|). 

1. it ensures a descent property of the cost and 

2. it maps a point (U,B, V) G Mp to (U+,B+, V+) E Mp+i. 
Proposition 4.1. /f X = UBV-^ then the rank-one update 

X+ = X - (3uv'^ 



(18) 



ensures a decrease in the objective function /(X) + A||X||:^ provided that (3 > is sufficiently 
small and the descent directions u G M" and v G are chosen as the dominant left and 
right singular vectors of the dual variable S = Gradx/(UBV-^). 



Proof. This is in fact a descent step as shown in [12^ [201 I21j but now projected onto the 
rank-one dominant subspace. □ 

A representation of X-j. on is obtained from the singular value decomposition of X-|_. 

Since X+ is a rank-one update of UBV"^, the singular value decomposition can be performed 
efficiently [lOJ. Defining u' and v' such that n' = (I — UU-^)n and u' = (I — VV^)(— /3f), 
which are the orthogonal projections of u and v on the complementary space of U and V, 
the update (fTH]) is written as 



X+ = UBV^ - /3mw^ = [U 



u 



B 
1 



[V 



where 



K 



1 U^n 



B 

1 



1 -pW^v 
1 
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It should be noted that K is of size (p + 1) X + 1). If P'S'Q" is the singular value 
decomposition of K where P' and Q' are orthonormal matrices and Xl' is a diagonal matrix 
then the new point X+ G -Mp+i is given by 

U+ = [U ^]P', B+ = 5]' and V+ = [V ^]Q'. (19) 

To compute (3 we perform a backtracking line search. 

Comments 

The algorithm described in Table [1] always converges to the global solution and stops at most 
when p = min{m,n}. In this case we end up solving the original problem itself. However, in 
practice the algorithm stops at a rankp <^ min{m, n} (often p = r where r is the optimal rank) 
as clear from the numerical experiments presented later in section ([U]). This makes it appealing 
for large-scale applications where memory requirements are a major issue, e.g., \17 \ \26 \ [25]. 
One advantage of the scheme, in contrast to trace norm minimization algorithms proposed 
in \12\ [33l [20l [2T] , is that it offers a tight control of the rank at all intermediate iterates of 
the scheme. In singular value thresholding schemes, for each iteration, the (full) Euclidean 
gradient information is proposed necessary for the singular value shrinkage operation [12] that 
thresholds the rank. In contrast, our scheme uses only rank-one information of the gradient 
while incrementing rank and a rank-p information while optimizing on the fixed-rank manifold. 

Another advantage is that the stopping criterion threshold of the non-convex ^ and that 
of the convex ([1]) can be separated. This means that rank-increments can be made after a 
fixed number of iterations of the manifold optimization without waiting for the trust-region 
algorithm to converge to a local minimum. 

5 Regular izat ion path 

In applications the optimal value of A is usually unknown [21j and hence, there is a need to 
solve ([1]) for a number of regularization parameters. In addition, even if the optimal A is a 
priori known, it might be better to build a path of solutions corresponding to different values 
of A that lead to the optimal value. This gives more interpretability to the intermediate 
iterates which are now seen as global minima for different values of A. This motivates the 
interest in an efficient way to compute the full solution (or regularization) path of ([T|) for a 
number of values A, i.e., defined as 

X*(Ai) = argmiuxeRnxm /(X) + Ai||X||* 

where X*(Ai) is the solution to the Aj minimization problem. A common approach is the 
warm-restart approach where the algorithm to solve the Aj+i problem is initialized from 
X*(Aj) and so on [2T]. However, the warm-restart approach does not utilize the fact that 
the regularization path is often (to some degree) smooth especially when the values of A are 
close to each other. In this section we describe a predictor- corrector scheme that takes into 
account the first-order smoothness and computes the path efficiently. To compute the path 
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Computing the regularization path 

0. Given {\i}i=i,...,N in decreasing order. Also given are the solutions X*(Ai) and 
X*(A2) at Ai and A2 respectively and their low-rank factorizations. 

1. Predictor step: 

• If X*(Ai_i) and X*(Ai) belong to the same quotient manifold A4p then 
construct a first-order approximation of the solution path at Ai and esti- 
mate X(Ai+i) as shown in (|2ip . 

• Else X(A,+i) = X*(A.). 

2. Corrector step: Using the estimated solution of the A^+i — problem, initialize 
the algorithm described in Table [T]to compute the exact solution X*(Ai+i). 

3. Repeat steps 1 and 2 for all subsequent values of A. 



Table 2: Algorithm for computing the regularization path. If is the number of values of 
A and r is the number of rank increments then the scheme uses r warm restarts and N — r 
predictor steps to compute the full path. 



we take a predictor (estimator) step to predict the solution and then rectify the prediction 
by a corrector step. This scheme has been widely used in solving differential equations and 
regression problems [28]. We extend the prediction idea to the quotient manifold Aip. The 
corrector step is carried out by initializing the algorithm in Table [1] from the predicted point. 
If X*(Ai) = UjBjVj"^ is the polar factorization then the solution of the Aj+i optimization 
problem is predicted (or estimated), i.e., X(Aj+i) = Uj+iBj+iV^^, by the two previous 
solutions X*(Aj) and X*(Aj_i) at A^ and Ai_i respectively belonging to the same rank manifold 
Mp. When X*(Aj_i) and X*(Aj) belong to different rank manifolds we perform instead a 
warm restart to solve Aj+i problem. The complete scheme is shown in Table [2] and has the 
following advantages. 

• With a few number of rank increments we traverse the entire path. 

• Potentially every iterate of the optimization scheme is now a global solution for a value 
of A. 

• The predictor-corrector approach outperforms the warm-restart approach in maximizing 
prediction accuracy with minimal extra computations. 

In this section, we assume that the optimization problem ([T|) has a unique solution for all A. 
A sufficient condition is that / is strictly convex, which can be enforced by adding a small 
multiple of the square Frobenius norm to /. 

Predictor step on the quotient manifold Aip 

Assuming (first-order) smoothness of the regularization path on Aip connecting (Uj,Bj, Vj) 
and (Uj_i, Bj_i, V.j_i) in Aip, we build a first-order approximation of the geodesic, i.e. 
the curve of shortest length, connecting the two points. The estimated solution X(Ai+i) 
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Figure 2: Tracing the path of solutions using predictor-corrector approach. The blue line 
denotes the curve of optimal solutions. 

is then computed by extending the geodesic. In other words, we need to identify a vector 
C e T[^Ui,Bi,Vi)]-Mp and its horizontal lift ^ G 'H{Ui,Bi,Vi)Mp at (UjjBj, Vj) on Mp defined 
as 

C = Log(Ui,Bi,Vi)(Ui-i, Bi_i, Vi_i) 

that maps (Uj-i, Bj_i, Vj_i) on M.p to the horizontal space T-l(i!j,'Bj,Vj)-^p [3]- Log is 
referred to as logarithmic mapping. Computing the logarithmic mapping (and hence, the 
geodesic) might be numerically costly in general. For the case of interest there is no analytic 
expression for the logarithmic mapping. Instead a numerically efficient way is to use the 
inverse retraction R^^_ g. -y_-j(Uj_i, Bj_i, Vj_i) where R^^ : A4p £ to obtain a direction 
in the space £ followed by projection onto the horizontal space 'W(Uj,Bj,Vj)-^p- Note that 
£ := M'^^P X RP^P X M^-^P. The projection is accomplished using projection operators ^ : 

£ T'(u,,B„v,)-^p n : T'(u,,B,,v,)-^p 'W(U,,B„v,)-^p defined in Section [3l Hence, an 
estimate on ^ is given as 

|=n(^(i?-^^3^^^^)(U,_i,B,_i,V,_i))) (20) 

For the retraction of interest ()17p the Frobenius norm error in the approximation of the 
Logarithmic mapping is bounded as 

lll-eilF = |||-i2^t]„B„v,)(Ui-i,Bi_i,Vi_i) + i?-^^3^^_v^)(Ui„i,Bi_i,Vi_i)-elF 

- -^(Ui,Bi,Vi)(^*"l'-^*~l''^*"l'*ll'^ ll-^(Ui,Bi,Vi)(^«-l'-^»-l''^»-l) ~ 

as UW^O. 

The Odl^llj;') approximation error comes from the fact that the retraction R used is a first- 
order retraction [3]. This approximation is exact if Aip is the Euclidean space. The inverse 
retraction corresponding to the retraction R described in (|17p is computed as 

i?u;(u,_i) = u,_i-Ui 

R^]{B,^i) = Bf log (B-^Bi_iBr5 j Bf 
%l(Vi-i) = V,-i-Vi 
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where log is the matrix logarithm operator. The predicted solution is then obtained by taking 
a step St and performing a backtracking line search in the direction — ^ i.e., 

(U,+i,B,+i, V,+i) = i?(u.,B..vo(-^i)- (21) 

A good choice of the initial step size sj is Y^^nr~^- The motivation for the choice comes 
the observation that it is optimal when the solution path is a straight line in the Euclidean 
space. The numerical complexity to perform the prediction step in the manifold Mp is 

6 Numerical Experiments 

The overall optimization scheme with descent-restart and trust-region algorithm is denoted 
as "Descent-restart -|- TR" (TR). We test the proposed the optimization framework on the 
problems of low-rank matrix completion and multivariate linear regression where trace norm 
penalization has shown efficient recovery. Full regularization paths are constructed with 
optimality certificates. All simulations in this section are performed in MATLAB on a 2.53 
GHz Intel Core 15 machine with 4 GB of RAM. 

6.1 Diagonal versus matrix scaling 

Before entering a detailed numerical experiment we illustrate here the empirical evidence 
that constraining B to be diagonal (as is the case with SVD) is detrimental to optimization. 
To this end, we consider the simplest implementation of a gradient descent algorithm for 
matrix completion problem (see below). The plots shown Figure [3] compare the behavior of 
the same algorithm in the search space St(p, n) x S'++(p) x St(p, m) (polar factorization) and 
St(p, n) X Diag+(p) x St(p, m) (SVD). Diag^(p) is the set of diagonal matrices with positive 
entries. The empirical observation that convergence suffers from imposing diagonalization 
on B is a generic observation that doesn't depend on the particular problem at hand. The 
problem here involves completing a 200 x 200 of rank 5 from 40% of observed entries. A is 
fixed at 10-1°. 

6.2 Low-rank matrix completion 

The problem of matrix completion involves completing an n x m matrix when only a few 
entries of the matrix entries are known. Presented in this way the problem is "ill-posed" but 
becomes considerably interesting when in addition a low-rank reconstruction is also sought. 
Given an incomplete low-rank (but unknown) n x m real matrix X, a convex relaxation of 
the matrix completion problem is 

minxgRnxn. ||W0(X-X)||2, + A||X||, (22) 

for X G i^^-xf" and a regularization parameter A. Here || • \\f denotes the Frobenius norm, 
matrix W is an n x m weight matrix with binary entries and the operator denotes element- 
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Figure 3: Convergence of a gradient descent algorithm is affected by making B diagonal. 



wise multiplication. If A' is the set of known entries in X then, 



1 

otherwise. 



The problem of matrix completion is known to be combinatorially hard. However, by solving 
the convex relaxation (|22p a low-rank reconstruction is possible with a very high probability 
[13^ [Is] under certain assumptions on the number of observed entries. For an exact recon- 
struction, the lower bound on the number of known entries is typically of the order 0{nr+mr) 
where r is the rank of the optimal solution. Consequently, it leads to a very sparse weight 
matrix W, which plays a very crucial role for efficient algorithmic implementations. For our 
case, we assume that the lower bound on the number of entries is met and we seek a solution 
to the optimization problem (j22p . Customizing the terminology for the present problem, the 
convex function is 

/(X) = ||W0(X-X)||2,. 
Using the factorization X = UBV^, the p— ranked formulation is 

(^(U,B,V) = ||W0(X-UBV^)|||, + ATYace(B) 

where (U, B, V) G Mp. The dual variable S = 2(W (UBV'^ - X)). The gradient of (j) in 
£ can be computed as 

Gradu(^ = SVB, Graded = U^SV + AI and Gradv^ = S^UB. 

The directional derivative of the gradient of cj) along Z = (Zu, Zb, Zv) £ £ in the space £ is 



/ SVZb + SZvB + S,VB, Z(jSV + USZv + V, \ 
\ S^UZb + S^ZuB + S^UB, ) 

where S* = D(u,b,v)S[Z] = 2(W0(ZuBV^-FUZBV'^-hUBZ^)) is the directional derivative 
of S along Z. The Riemannian gradient and Hessian are computed using formulae developed 
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in Section [3l Note that since W is sparse, S and S* are sparse too. As a consequence, the 
numerical complexity of evaluating objects needed in the trust-region algorithm is of order 
0{\X\p + np'^ + mp'^ ^p"^) where \X\ is the number of known entries. Asymptotically, the total 
(problem dependent and manifold related) numerical complexity per iteration for solving the 
problem ([2]) using the trust-region algorithm proposed earlier is of order 0(|^Y|p). For using 
the optimization scheme described in Tabled! an additional thin singular value decomposition 
of dual variable S is performed. Due to sparsity in S, it can be computed with numerical 
complexity of order Od^"!). The linear complexity with respect to the number of known 
entries allows us to handle potentially very large datasets. 

Fenchel dual and duality gap for matrix completion 

For the matrix completion problem, the sampling operation is the linear operator ^(X) = 
W X. We can, therefore, define a new function ip such that /(X) = iJj{W X). The dual 
candidate M is defined by 

M = min(l, — )Grad'i/' 

where Grad'0(W X) = 2(W X — W X) and is the dominant singular value of 
A*{Grad'ip) (refer Section 12.21 for details). In matrix form, ^*(Grad^) can be written as 
W Grad^. Finally, the Fenchel dual ip* at a dual candidate M can be computed as 

V'*(M) = supzgK"x™ Trace(M^Z) - ||W0X- Z||| (23) 

where Z is in the domain of tp. The maximum of ()23p is obtained at 

Z* = i(M-F2W0X) 

and so, 

V'*(M) = Ti^ace(M^Z) - ||W0X- Z||2, 

= Trace (M^M)/4-FTr ace (M^(W0X)) 

The final expression for the duality gap at a point X and a dual candidate M = min(l, ^)GradV' 
is 

/(X) + A||X||* + Trace(M^M)/4 + TYace(M^(W X)). (24) 
In terms of the polar factorization X = UBV"^, the duality gap expression simplifies to 

/(UBV^) + ATrace(B) + Trace(M'^M)/4 + Trace(M^(W X)). 

The above expression shows that the numerical cost to compute the duality gap is 0(1*^1) 
which makes the computation numerically tractable. 

6.2.1 An example 

A 100 X 100 random matrix of rank 10 is generated according to a Gaussian distribution with 
zero mean and unit standard deviation. 20% of the entries are randomly removed with uniform 
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probability. To reconstruct the original matrix we run the optimization scheme proposed in 
the Table [T] along with the trust-region algorithm to solve the non-convex problem. For 
illustration purposes A is fixed at 1 x 10^^. We also assume that we do not have any a 
priori knowledge of the optimal rank and, thus, start from rank 1. The trust-region algorithm 
stops when the relative or absolute variation of the cost function falls below 1 x 10"^*^. The 
rank- incrementing strategy stops when relative duality gap is less than 1 x 10~^, i.e., 

/(X) + A||X|| +V.-(M)^^^ 

|V'*(M)| 

respectively. Convergence plots of the scheme are shown in Figure HI A good way to charac- 
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Figure 4: Matrix completion by trace norm minimization algorithm with A = 1 x 10^^. Upper 
left: Rank incremental strategy with descent directions. Upper right: Optimality certificate 
of the solution with duality gap. Lower left: Convergence to the global solution according to 
Proposition 12.11 . Lower right: Recovery of the original low-rank matrix. 

terize matrix reconstruction is to look at the relative error of reconstruction, defined as, 

Rel. error of reconstruction = ||X — X*||i?/||X||i? 

where X* is the output of the trace norm minimization algorithm. Next, to understand 
low-rank matrix reconstruction by trace norm minimization we repeat the experiment for 
a number of values of A all initialized from the same starting point and report the relative 
reconstruction error in Table [3] averaged over 5 runs. This, indeed, confirms that matrix 
reconstruction is possible by solving the trace norm minimization problem (|22p . 
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A 


10 


10"^ 


10"^ 


10"* 


Rel. reconstruction error 


6.33 X 10"^ 


7.42 X 10"^ 


7.11 X 10~* 


6.89 X 10"" 


Recovered rank 


10 


10 


10 


10 



Table 3: Efficacy of trace norm penalization to reconstruct low-rank matrices by solving (|22p . 

6.2.2 Regularization path for matrix completion 

In order to compute the entire regularization path, we employ the predictor-corrector ap- 
proach described in Table [5] to find solutions for a grid of A values. For the purpose of 
illustration, a geometric sequence of A values is created with the maximum value fixed at 
Ai = 1 X 10^, the minimum value is set atA7v = lxlO ^ and a reduction factor 7 = 0.95 such 
that Aj+i = 7Ai. Similar to the earlier example, a 100 x 100 matrix of rank 10 is generated 
under the standard assumptions. In this case we remove 20% of the entries with uniform 
probability. The algorithm for a Aj G {Ai,...,AAr} stops when the relative duality gap falls 
below 1 X 10"^. Various plots are shown in Figure [5l Figure [5] also demonstrates the advan- 
tage of the scheme in Table [2] with respect to a warm-restart approach. We compare both 
approaches on the basis of 

Inaccuracy in prediction = (^(X(Ai)) — 0(X*(Aj)) (25) 

where X*(Aj) is the global solution at Aj and X(Aj) is the predicted (or estimated) solution. 
A lower inaccuracy means better prediction. It should be emphasized that in Figure H] most of 
the points on the curve of the objective function have no other utility than being intermediate 
iterates towards the global solution of the algorithm. In contrast all the points of the curve 
of optimal cost values in Figure [5] are now global minima for different values of A. 

6.2.3 Competing methods for matrix completion 

In this section, we analyze the following state-of-the-art algorithms for low-rank matrix com- 
pletion, namely, 

1. SVT algorithm by Cai et al. [I2] 

2. FPCA algorithm by Ma et al. [20j 

3. SOFT-IMPUTE (Soft-I) algorithm by Mazumder et al. [H] 

While FPCA and SOFT-IMPUTE solve ([22]), the iterates of SVT converge towards a solution 
of the optimization problem 

min ''"ll^ll* + ^II^IIf 

subject to W0X = W0X 

where r is a regularization parameter. For simulation studies we use the MATLAB codes sup- 
plied on the authors' webpages for SVT and FPCA. Due to simphcity of the SOFT-IMPUTE 
algorithm we use our own MATLAB implementation. The numerically expensive step in all 
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Figure 5: Computation of entire regularization path using Descent-restart + TR with a 
predictor-corrector approach. Upper left: Recovery of ah ranks optimal solutions. Upper 
right: Optimality certificate for the regularization path. Lower left: Path traced by the 
algorithm. Lower right: Better prediction by the algorithm in Table |2] than a pure warm- 
restart approach. Table: Number of iterations per value of A is < 3. 



these algorithms is the computation of the singular value thresholding operation. To reduce 
the computational burden FPCA uses a linear time approximate singular value decomposi- 
tion (SVD). Similarly, the authors of SVT and SOFT-IMPUTE implement the thresholding 
operation by utilizing the sparse + low-rank structure of the iterates. Specifically, SVT and 
SOFT-IMPUTE use PRO PACK which exploit sparsity of the data while computing singular 
values pSj. It should be emphasized that the performance of SOFT-IMPUTE greatly varies 
with the singular values computation at each iteration. For our simulations we compute 20 
dominant singular values at each iteration of SOFT-IMPUTE. 

Convergence behavior with varying A 

In this section we analyze the algorithms FPCA, SOFT-IMPUTE and Descent-restart -|- TR 
in terms of their ability to solve (I22p as A is varied. SVT is not used for this test since it 
optimizes a different cost function. We plot the objective function /(X) -|- A||X||* against the 
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number of iterations for a number of A values. 

We generate a 100 x 100 random matrix of rank 8 under standard assumptions. 80% of 
the entries are observed. The algorithms Descent-restart + TR, FPCA and SOFT-IMPUTE 
are initialized from the same point. The algorithms are stopped when either the variation or 
relative variation of /(X) -|- A||X||* is less than 1 x 10~^^. The maximum number of iterations 
is set at 500. The rank incrementing procedure of our algorithm is stopped when the relative 
duality gap falls below 1 x 10~^. 
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Figure 6: Convergence behavior of different algorithms with different A values. 



The plots are shown in Figure [6j Descent-restart -|- TR outperforms the others in terms of 
accuracy of solutions obtained with minimal number of iterations. The convergence behavior 
of FPCA is greatly affected by A. It has a slow convergence for a small A while for a larger 
A, the algorithm fluctuates. SOFT-IMPUTE has a nice convergence in all the three cases, 
however, the convergence suffers when a more accurate solution is sought. 



Convergence test 

To understand the convergence behavior of different algorithms involving different optimiza- 
tion problems, we look at the evolution of the training error [12\ [2T] defined as 

Training error = ||W (X - X)|||^, 

with iterations. We generate a 150 x 300 random matrix of rank 10 under standard assump- 
tions. Only 50% of the entries are observed. The algorithms Descent-restart -|- TR, FPCA 
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and SOFT-IMPUTE (Soft-I) are initialized from the same iterate with a fixed A. We fix 
A = 1 X 10^^ as it gives a good reconstruction to compare algorithms. For SVT we use the 
initialization including r = and a step size of ^ as suggested in the paper [12j where 

/ is the fraction of known entries. The algorithms are stopped when the variation or relative 
variation of Training error is less than 1 x 10^^°. The maximum number of iterations is set 
at 500. The rank incrementing procedure of our algorithm is stopped when either the duality 
gap falls below 1 x 10~^ or the relative duality gap falls below 1 x 10~^. 

SOFT-IMPUTE initially has a very fast convergence (for iterations less than 60) but the 
performance slows down later. Consequently, it exceeds the maximum limit of iterations. 
Descent-restart -|- TR, SVT and FPCA, on the other hand, seem to be better equipped 
for targeting more accurate solutions. Descent-restart -|- TR takes the smallest number of 
iterations of all. 
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Figure 7: Convergence behavior of different algorithms in terms of minimizing Training er- 
ror. Left: Faster convergence of Descent-restart and FPCA. Table: Timings and number of 
iterations averaged over 5 runs. 



Scaling test 

To analyze the scalability of these algorithms to larger problems we perform a test where we 
vary the problem size n from 200 to 2200. For each n, we generate a random matrix of size 
n X n of rank 10 under standard assumptions. Each entry is observed with uniform probability 
of f = ^^^°gio(^) [18]. The initiahzations are chosen as in the earlier example. We note the 
time and number of iterations taken by the algorithms until a stopping criterion is satisfied 
or when the number of iterations exceed 500. The stopping criterion is same as the one used 
before for comparison. Results averaged over 5 runs are shown in Figure [8l 

Comments on matrix completion algorithms 

We summarize our observations in the following points. 

• The convergence rate of SOFT-IMPUTE is greatly dependent on the computation of 
singular values. For large scale problems this is a bottleneck and the performance is 
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Figure 8: Analysis of the algorithms on randomly generated datasets of rank 10 with varying 
fractions of missing entries. The reduced performance of SOFT-IMPUTE is due to its slower 
convergence. SVT, FPCA and Descent-restart -|- TR have similar performances but Descent- 
restart -|- TR usually outperforms others. 



greatly affected. However, in our experiments, it performs quite well within a reasonable 
accuracy as seen in Figure [6] and Figure [71 

• SVT, in general, performs quite well on random examples. The choice of the fixed 
step size and regularization parameter r, however, affect the convergence speed of the 
algorithm pOllST]. 

• FPCA has a superior numerical complexity per iteration owing to an approximate sin- 
gular value decomposition |20j . But the performance suffers as the regularization pa- 
rameter A is increased as shown in Figure [6l 

• In all the simulation studies on random examples Descent-restart+TR has shown better 
performance than others on different benchmarks and it competes effectively with the 
state-of-the-art. 



6.3 Multivariate linear regression 

Given matrices Y G R"^'^ (response space) and X G M"^"^ (input data space), we seek to 
learn a weight/coefficient matrix W G M'^^'^ that minimizes the loss between Y and XW |37j . 
Here n is the number of observations, q is the number of predictors and k is the number of 
responses. One popular approach to multivariate linear regression problem is by minimizing 
a quadratic loss function. Note that in various applications responses are related and may 
therefore, be represented with much fewer coefficients. From an optimization point to view 
this corresponds to finding a low-rank coefficient matrix. The papers [3313]) thus, motivate 
the use of trace norm regularization in the following optimization problem formulation, defined 
as, 

minYvgR<?xfc ||Y - XW|||^ -I- A||W||*. 
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(Optimization variable is W.) Although the focus here is on the quadratic loss function, the 
framework can be applied to other smooth loss functions as well. Other than the difference 
in the dual variable S and S*, the rest of the computation of gradient and its directional 
derivative in the Euclidean space is similar to that of the low-rank matrix completion case. 

S = 2(X^XW - X^Y) and 

S* = D(u3,v)S[Z] = 2(X^X(ZuBV^ + UZbV^ + UBZ^)) 
where W = UBV^. 

The numerical complexity per iteration is dominated by the numerical cost to compute 
^(U, B, V), S and terms like SVB. The cost of computing ^ is of 0{nqp + nkp + kp^ + nk) 
and SVB is 0{q^p + qkp + kp^). And that of fuh matrix S is 0{q^p + qkp + kp"^). Prom a 
cubic numerical complexity of 0{q^k) per iteration (using the full matrix W ) the low-rank 
factorization reduces the numerical complexity to 0{q^p + qkp) which is quadratic. Note that 
the numerical complexity per iteration is linear in n. 

Fenchel dual and duality gap for multivariate linear regression 

For the multivariate linear regression problem ^(W) = XW and therefore, we can define tp 
such that /(W) = V'(XW). So, A*{r]) = X'^ry. The dual candidate M is defined by 

M = min(l, — )Grad'0 

(Tip 

where GradV'(XW) = 2(XW — Y) and is the dominant singular value of A* (Giadil)) = 
X-^GradV'. The Fenchel dual V'* (after few more steps) can be computed as 

^*(M) = Trace(M^M)/4 + Trace(M'^Y). 

Finally, the duality gap is computed as /(W) -|- A||W||* -|- ^*(M). As we use a low-rank 
factorization of W, i.e., W = UBV-^ the numerical complexity of finding the duality gap is 
dominated by numerical cost of computing ■0*(M) which is also of the order of the cost of 
computing 0(U,B, V). 

• M ^ order 0{nqp + nkp + kp^) 

• V*(M) order 0{nk). 

6.3.1 Regular izat ion path for multivariate linear regression 

An input data matrix X of size 5000 x 120 is randomly generated according to a Gaussian 
distribution with zero mean and unit standard deviation. The response matrix Y is computed 
as XW* where W* is a randomly generated coefficient matrix of rank 5 matrix and size 
120 X 100. We randomly split the observations as well as responses into training and testing 
datasets in the ratio 70/30 resulting in Ytrain/Ytest and Xtrain/Xtest ■ A Gaussian white noise 
of zero mean and variance cr^oisg is added to the training response matrix Ytrain resulting in 
Ynoise- We learn the coefficient matrix W by minimizing the scaled cost function, i.e., 

min-yYgjggxfc — rll"^noise ~ Xtrain'WII^ + -^IIWH*, 
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where A is a regularization parameter. We validate the learning by computing the root mean 
square error (rmse) defined as 
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■ test 



XtestWl 



ntestk 

where ntest is the number of test observations. Similarly, the signal to noise ratio (SNR) is 
defined as 



IV . I|2 
I ^ tram II ^ 

noise 

We compute the entire regularization path for four different SNR values. The maximum 
value of A is fixed at 10 and the minimum value is set at 1 x 10^^ with the reduction factor 
7 = 0.95. Apart from this we also put the restriction that we only fit ranks less than 30. The 
solution to an optimization problem for Xj is claimed to have been obtained when either the 
duality gap falls below 1 x 10^^ or the relative duality gap falls below 1 x 10^^ or cji — A 
is less than 1 x 10"'^. Similarly, the trust-region algorithm stops when relative or absolute 
variation of the cost function falls below 1 x 10^^*^. The results are summarized in Figure [H 
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Figure 9: Regularization path for multivariate linear regression with various SNR values. 
Results are averaged over 5 random 70/30 splits. 



7 Conclusion 



Three main ideas have been presented in this paper. First, we have given a framework to 
solve a general trace norm minimization problem ([1]) with a sequence of increasing but fixed- 
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rank non-convex problems ([2]). We have analyzed the convergence criterion and duality gap 
which are used to monitor convergence to a solution of the original problem. The duality gap 
expression was shown numerically tractable even for large problems thanks to the specific 
choice of the low-rank parametrization. We have also given a way of incrementing the rank 
while simultaneously ensuring a decrease of the cost function. This may be termed as a 
descent-restart approach. The second contribution of the paper is to present a second-order 
trust-region algorithm for a general p — rank (fixed-rank) optimization in the quotient search 
space St(j), n) x 5++(p) x St{p, m)/0{p) equipped with the natural metric g pT]). The search 
space with the metric g has the structure of a Riemannian submersion [3]. We have used 
manifold-optimization techniques [3] to derive the required geometric objects in order to 
devise a second-order algorithm. With a proper parameter tuning the proposed trust-region 
algorithm guarantees a quadratic rate of convergence. The third contribution of the paper 
is to develop a predictor-corrector algorithm on the quotient manifold where the predictor 
step is along the first-order approximation of the geodesic. The corrector step is achieved by 
initializing the descent-restart approach from the predicted point. The resulting performance 
is superior to the warm-restart approach. 

These ideas have been applied to the problems of low-rank matrix completion and multi- 
variate linear regression leading to encouraging numerical results. 
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